model compression
Linearly Decomposing and Recomposing Vision Transformers for Diverse-Scale Models
Vision Transformers (ViTs) are widely used in a variety of applications, while they usually have a fixed architecture that may not match the varying computational resources of different deployment environments. Thus, it is necessary to adapt ViT architectures to devices with diverse computational overheads to achieve an accuracy-efficient trade-off. This concept is consistent with the motivation behind Learngene. To achieve this, inspired by polynomial decomposition in calculus, where a function can be approximated by linearly combining several basic components, we propose to linearly decompose the ViT model into a set of components called learngenes during element-wise training. These learngenes can then be recomposed into differently scaled, pre-initialized models to satisfy different computational resource constraints. Such a decomposition-recomposition strategy provides an economical and flexible approach to generating different scales of ViT models for different deployment scenarios. Compared to model compression or training from scratch, which require to repeatedly train on large datasets for diverse-scale models, such strategy reduces computational costs since it only requires to train on large datasets once. Extensive experiments are used to validate the effectiveness of our method: ViTs can be decomposed and the decomposed learngenes can be recomposed into diverse-scale ViTs, which can achieve comparable or better performance compared to traditional model compression and pre-training methods. The code for our experiments is available in the supplemental material.
HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression
Haase, Gustavo Coelho, da Silva, Paulo Henrique Dourado
Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.
- North America > United States > District of Columbia > Washington (0.05)
- South America > Brazil > Federal District > Brasília (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Poland (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Poland (0.04)
- North America > United States > Texas > Brazos County > College Station (0.04)
- North America > Canada (0.04)
- Research Report (0.68)
- Overview (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Austria (0.04)
- Asia (0.04)
Reviewer
We thank the reviewers for their helpful comments. The code and models will be open-sourced. Along with peak RAM, we report inference FLOPs for all the models. Finally, the rationale behind ImageNet-10 can be found in Appendix A.1. And even then ReNet's accuracy is GRU or LSTM as the RNN unit; we use GRU as it is more efficient.
A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models
As we described in Section 3.2.2 of the main paper, we realize mask training via binarization in In practice, we control the sparsity in a local way, i.e., all the weight matrices We have introduced the PoE method in Section 3.3. Work was done when Y uanxin Liu was a graduate student of IIE, CAS. We utilize eight datasets from three NLU tasks. Tab. 2 shows the distribution of examples over classes. We use two types of GPU, i.e., Nvidia V100 and TIT AN RTX.